\newcommand{\bs}[1]{\mathbf{#1}}
motivating case study
GP regression
kernels
Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A visual exploration of Gaussian processes. Distill, 4(4). doi:10.23915/distill.00017
Deisenroth, M., Luo, Y., & van der Wilk, M. (2020). A practical guide to Gaussian processes. https://infallible-thompson-49de36.netlify.app/.
Question: How quickly does a star rotate around its axis? This has to be answered accurately before we can search for exoplanets around the star.
Data: Telescopes make it possible to measure the brightness of a star over time. We can look for periodic patterns in the brightness.
Ideally the brighness would be a perfect sine wave. But brightness changes in more complex ways both on the surface of the star and in the space from the star to the telescope.
How does flux vary as the star rotates?
Formulation.
Goal. Infer P to understand stellar activity and improve exoplanet detection
Modeling. We need a statistical model of functions that:
Definition. A Gaussian Process (GP) is a collection of random variables, any finite subset of which is jointly Gaussian
f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))
This can be used to define a prior distribution over functions. \mathbf{f} = (f(x_1), \ldots, f(x_N))^\top \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})
where \mu_i = m(x_i) and K_{ij} = k(x_i, x_j). We will set m(\mathbf{x}) = 0.
Classical statistics Estimate fixed-dimensional parameters \theta of a function.
GPs Estimate the entire function f.
Observation This seems impossible, since functions are infinite-dimensional. But we only observe f at finitely many points, \begin{align*} \mathbf{f} = \left(f\left(x_1\right), \dots, f(x_M)\right) \end{align*} and a GP gives us a prior for any choice of x_{1}, \dots, x_{M}.
This is what it means to define a “distribution over functions.”
Covariance matrix \mathbf{K}:
The kernel encodes our assumptions about function smoothness, periodicity, trends, etc.
Each sample is one plausible function from our prior.
Different kernels \to different classes of functions.
We have defined a prior over functions. Now: update beliefs given data \{x_i, y_i\}_{i=1}^N.
Prior f \sim \mathcal{GP}(0, k(\mathbf{x}, \mathbf{x'}))
Observation model y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2) \text{ independent}
Let \mathbf{x} = (x_1, \ldots, x_N) denote training locations. Write \mathbf{y} = (y_1, \ldots, y_N).
Claim: The joint distribution is \begin{bmatrix} \mathbf{f}_* \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K}_{**} & \mathbf{K}_{*} \\ \mathbf{K}_{*}^\top & \mathbf{K} + \sigma_n^2\mathbf{I} \end{bmatrix}\right)
Notation: Kernel matrices are
Let \mathbf{x} = (x_1, \ldots, x_N) denote training locations. Write \mathbf{y} = (y_1, \ldots, y_N).
Claim: The joint distribution is \begin{bmatrix} \mathbf{f}_* \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K}_{**} & \mathbf{K}_{*} \\ \mathbf{K}_{*}^\top & \mathbf{K} + \sigma_n^2\mathbf{I} \end{bmatrix}\right)
By definition of GP: \begin{bmatrix} \mathbf{f}_* \\ \mathbf{f} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K}_{**} & \mathbf{K}_* \\ \mathbf{K}_*^\top & \mathbf{K}\end{bmatrix}\right) where \mathbf{f} = (f(x_1), \ldots, f(x_N)).
Since \mathbf{y} = \mathbf{f} + \boldsymbol{\epsilon} is a sum of independent Gaussians, it is Gaussian. Using independence of \mathbf{f} and \boldsymbol{\epsilon}:
\text{Cov}(\mathbf{f}_*, \mathbf{y}) = \text{Cov}(\mathbf{f}_*, \mathbf{f}) = \mathbf{K}_*
\text{Cov}(\mathbf{y}, \mathbf{y}) = \text{Cov}(\mathbf{f} + \boldsymbol{\epsilon}, \mathbf{f} + \boldsymbol{\epsilon}) = \mathbf{K} + \sigma_n^2\mathbf{I}
Now we condition on observed \mathbf{y} to get a posterior for the f that generated the data. The Gaussian conditioning formula gives
p(\mathbf{f}^* | \mathbf{y}, \mathbf{X}, \mathbf{x}^*) = \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*) where
\begin{align*} \boldsymbol{\mu}_* &= \mathbf{K}_{*}(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y}\\ \boldsymbol{\Sigma}_* &= \mathbf{K}_{**} - \mathbf{K}_{*}(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{K}_{*} \end{align*}
The next few slides interpret this formula.
\begin{align*} \boldsymbol{\mu}_*=\mathbf{K}_{*}\left(\mathbf{K}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y} \end{align*}
\begin{align*} \boldsymbol{\mu}_*=\mathbf{K}_{*}\left(\mathbf{K}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y} \end{align*}
If f^\ast and y were both scalars, this simplifies, \begin{align*} \mu_{f^\ast \vert y} = \frac{\operatorname{cov}\left(f^*, y\right)}{\sigma_y^{2}}y = \rho \frac{\sigma_{f^\ast}}{\sigma_y}y \end{align*}
\begin{align*} \boldsymbol{\Sigma}_* = \mathbf{K}_{**} - \mathbf{K}_{*}(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{K}_{*} \end{align*}
Discuss [GP Conditioning] part (a) in the exercise sheet. Self-assign groups – only one response per group needed.
Consider a GP with zero mean function and RBF kernel k(x, x′) = \exp(−\|x − x′\|^2). You have two test points x^* = [0, 2]^\top at which you want predictions, but no training data.
- What is the mean vector and covariance matrix of the prior distribution p(\mathbf{f}_{*}) at the test points.
Linear regression makes predictions using
\begin{align*} \mathbb{E}[y | \mathbf{x}] = \mathbf{x}^\top \boldsymbol{\beta}. \end{align*}
Geometrically, predictions at nearby points should be similar.
What does “nearby” mean? In linear regression, \begin{align*} \mathbf{x}^\top \mathbf{x}' \end{align*} measures closeness.
Suppose \beta \sim \mathcal{N}\left(0, \sigma^2 I\right). Then,
\begin{align*} \text{Cov}(y, y') &= \mathbb{E}[\mathbf{x}^\top \beta \beta^\top \mathbf{x}'] \\ &= \mathbf{x}^\top \mathbb{E}[\beta \beta^\top] \mathbf{x}' \\ &= \mathbf{x}^\top (\sigma^2 \mathbf{I}) \mathbf{x}' \\ &= \sigma^2 \mathbf{x}^\top \mathbf{x}' \end{align*}
Covariance between predictions depends on \mathbf{x}^\top \mathbf{x}'.
Kernel answer: Define closeness explicitly.
\begin{align*} k(\mathbf{x}, \mathbf{x}') = \text{how much } f\left(\mathbf{x}\right) \text{ and } f\left(\mathbf{x'}\right) \text{should covary} \end{align*}
The function k is called a “kernel”.
k_{\text{RBF}}(\bs{x}, \bs{x}') = \sigma_f^2 \exp\left(-\frac{\|\bs{x} - \bs{x}'\|^2}{2\ell^2}\right)
\text{Correlation} = \exp\left(-\frac{\tau^2}{2\ell^2}\right), \quad \tau = \|\bs{x} - \bs{x}'\|
Small \ell (relative to data range)
Long lengthscale:
\text{Correlation} = \exp\left(-\frac{\tau^2}{2\ell^2}\right), \quad \tau = \|\bs{x} - \bs{x}'\|
k_{\text{per}}(\bs{x}, \bs{x}') = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi\|\bs{x}-\bs{x}'\|/p)}{\ell^2}\right)
Parameters:
This will help us model stellar rotation.
k_{\text{Matérn-3/2}}(\bs{x}, \bs{x}') = \sigma_f^2\left(1 + \frac{\sqrt{3}\tau}{\ell}\right)\exp\left(-\frac{\sqrt{3}\tau}{\ell}\right)
k_{\text{Matérn-5/2}}(\bs{x}, \bs{x}') = \sigma_f^2\left(1 + \frac{\sqrt{5}\tau}{\ell} + \frac{5\tau^2}{3\ell^2}\right)\exp\left(-\frac{\sqrt{5}\tau}{\ell}\right)
Unlike RBF: Finite differentiability
(You don’t need to memorize these formulas!)
Samples that use a Matérn kernel are often “rougher” are more realistic.
If k_1 and k_2 are valid kernels, so are
\begin{align*} k_{\text{sum}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') + k_2(\bs{x}, \bs{x}') (\text{ superposition})\\ k_{\text{prod}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') \times k_2(\bs{x}, \bs{x}') (\text{ modulation}) \end{align*}
Addition is like two independent phenomena occuring at once.
If k_1 and k_2 are valid kernels, so are
\begin{align*} k_{\text{sum}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') + k_2(\bs{x}, \bs{x}') (\text{ superposition})\\ k_{\text{prod}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') \times k_2(\bs{x}, \bs{x}') (\text{ modulation}) \end{align*}
Multiplication is like where one pattern modulates another.
Since kernels are closed using these algebraic operations, they can be tailored to the specific problem of interest.
Kernel objects can be added and multiplied.
Respond to [Interactive Kernel Choice] parts (a) and (b) in the exercise sheet.